Memory-aware algorithms : from multicores to large scale platforms. (Algorithmes orientés mémoire : des processeurs multi-cœurs aux plates-formes à grande échelle)
نویسنده
چکیده
Throughout this thesis, we focus on designing memory-aware algorithms and schedules tailored forhierarchical memory architectures. Nowadays, these memory layouts can be found from within theheart of a multicore processor to the storage architectures of larger-scale platforms like supercomputers.Several platforms of various scale are studied in order to assess the impact on performance of suchmemory architectures. We first study both the complexity and the performance of matrix product on multicore architectures.Indeed, dense linear algebra kernels are the key to performance for many scientific applications. Weintroduce a realistic but still tractable model of a multicore processor, and derive lower bounds on thecommunication volume. We adapt matrix product algorithm to multicore architectures by taking cachesinto account, leading to three algorithms. Hence, the focus is set on minimizing cache misses. We assessboth model relevance and performance of algorithms through an extensive set of experiments, rangingfrom simulation to real implementation on a GPU.We then target a more complex operation: the QR factorization of rectangular matrices composed ofp × q tiles, where p ≥ q. This dense linear algebra kernel lies at the foundation of many scientificapplications. We thus revisit existing algorithms so as to better exploit the additional level of parallelismoffered by multicore processors. Within this framework, we study the critical paths and performance ofseveral algorithms and prove some of them to be asymptotically optimal. We conclude this study by anextensive set of experiments that show the superiority of the new algorithms for tall matrices.In the next study, we focus on scheduling streaming applications onto a heterogeneous multicore plat-form, the QS 22. We experimentally evaluate communication performance within the QS 22 and intro-duce a model of the platform based upon these results. We then use steady-state scheduling techniquesin order to maximize the throughput, that is the number of instances processed per time-unit. We thenpresent a mixed integer programming (MIP) approach that allows to compute a mapping with optimalthroughput, propose simpler heuristics, and experimentally assess the performance of all approaches onthe QS 22.We then focus on minimizing the amount of required memory for a given application. We study thetraversal of tree-shaped workflows, which typically arise in sparse matrix factorization, and target aclassical two-level memory system. In this context, I/O represent transfers from a memory to the other.We propose a new exact algorithm which is more efficient in practice than an existing optimal algorithm.We also show that there exist trees where commonly used postorder based traversals require arbitrarilylarger amounts of main memory than the optimal one. We then study the problem of minimizing theI/O volume for a given memory, and show that it is NP-hard, both for postorder based and for arbi-trary traversals. We provide a set of heuristics to solve this problem, and experimentally assess theirperformance on existing trees.Finally, we compare archival policies for BLUE WATERS. Indeed hierarchical memory architecturesare also found into the storage system of cutting-edge supercomputers, like BLUE WATERS. We henceintroduce two archival policies tailored for the tape storage system of BLUE WATERS and adapt the wellknown RAIT strategy. We provide an analytical model of the tape storage platform, and use it to assessand discuss the performance of the three policies through simulation. We show that RAIT is alwaysoutperformed by either VERTICAL or PARALLEL, and introduce the HETERO policy which usesthe two latter policies, bringing a ten-fold performance improvement.
منابع مشابه
Multilevel communication optimal LU and QR factorizations for hierarchical platforms
This study focuses on the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We first introduce a new model called Hierarchical Cluster Platform (HCP), encapsulating the characteristics of such platforms. The focus is set on reducing the communication requirements of studied algorithms at each level of the hierar...
متن کاملThe impact of cache misses on the performance of matrix product algorithms on multicore platforms
The multicore revolution is underway, bringing new chips introducing more complex memory architectures. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at designing cache-aware algorithms that minimize the number of cache misses paid during the execution of the matrix product kernel on a multicore processor. We analytica...
متن کاملMemory-aware tree partitioning on homogeneous platforms
Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been w...
متن کاملUsing group replication for resilience on exascale systems
High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the o...
متن کاملApprentissage statistique à grande echelle
De nombreux problèmes d’apprentissage statistique à grande échelle sont formulés comme l’optimisation d’une fonction convexe dont on n’observe que des gradients bruités: cette fonction est typiquement l’erreur de généralisation, et seulement l’erreur sur une observation est disponible à chaque itération. Les algorithmes utilisés en pratique donnent lieu à des garanties de convergence dont l’etu...
متن کامل